🧩NVIDIA Nemotron 3 Nano - How To Run Guide
Run & fine-tune NVIDIA Nemotron 3 Nano locally on your device!
NVIDIA releases Nemotron 3 Nano, a 30B parameter hybrid reasoning MoE model with ~3.6B active parameters - built for fast, accurate coding, math and agentic tasks. It has a 1M context window and is best amongst its size class on SWE-Bench, GPQA Diamond, reasoning, chat and throughput.
Nemotron 3 Nano runs on 24GB RAM/VRAM (or unified memory) and you can now fine-tune it locally. Thanks NVIDIA for providing Unsloth with day-zero support.
NVIDIA Nemotron 3 Nano GGUF to run: unsloth/Nemotron-3-Nano-30B-A3B-GGUF We also uploaded BF16 and FP8 variants.
⚙️ Usage Guide
NVIDIA recommends these settings for inference:
General chat/instruction (default):
temperature = 1.0top_p = 1.0
Tool calling use-cases:
temperature = 0.6top_p = 0.95
For most local use, set:
max_new_tokens=32,096to262,144for standard prompts with a max of 1M tokensIncrease for deep reasoning or long-form generation as your RAM/VRAM allows.
The chat template format is found when we use the below:
tokenizer.apply_chat_template([
{"role" : "user", "content" : "What is 1+1?"},
{"role" : "assistant", "content" : "2"},
{"role" : "user", "content" : "What is 2+2?"}
], add_generation_prompt = True, tokenize = False,
)Nemotron 3 chat template format:
🖥️ Run Nemotron-3-Nano-30B-A3B
Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits.
Llama.cpp Tutorial (GGUF):
Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
You can directly pull from Hugging Face. You can increase the context to 1M as your RAM/VRAM allows.
Follow this for general instruction use-cases:
Follow this for tool-calling use-cases:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions.
Then run the model in conversation mode:
Also, adjust context window as required. Ensure your hardware can handle more than a 256K context window. Setting it to 1M may trigger CUDA OOM and crash, which is why the default is 262,144.
Nemotron 3 uses <think> with token ID 12 and </think> with token ID 13 for reasoning. Use --special to see the tokens for llama.cpp. You might also need --verbose-prompt to see <think> since it's prepended.
Because the model was trained with NoPE, you only need to change max_position_embeddings. The model doesn’t use explicit positional embeddings, so YaRN isn’t needed.
🦥 Fine-tuning Nemotron 3 Nano and RL
Unsloth now supports fine-tuning of all Nemotron models, including Nemotron 3 Nano. The 30B model does not fit on a free Colab GPU; however, we still made an 80GB A100 Colab notebook for you to fine-tune with. 16-bit LoRA fine-tuning of Nemotron 3 Nano will use around 60GB VRAM:
On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use at least 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.
✨Reinforcement Learning + NeMo Gym
We worked with the open-source NVIDIA NeMo Gym team to enable the democratization of RL environments. Our collab enables single-turn rollout RL training for many domains of interest, including math, coding, tool-use, etc, using training environments and datasets from NeMo Gym:
Also check out our latest collab guide published on NVIDIA’s official Developer blog:
🎉Llama-server serving & deployment
To deploy Nemotron 3 for production, we use llama-server In a new terminal say via tmux, deploy the model via:
When you run the above, you will get:

Then in a new terminal, after doing pip install openai, do:
Which will print
Benchmarks
Nemotron-3-Nano-30B-A3B is the best performing model across all benchmarks, including throughput.

Last updated
Was this helpful?

